MultiThreading Benchmarks

Roy Longbottom

Contents


General Whetstone Benchmarks Assembler Code Add Benchmark
BusSpeed Benchmark RandMem Benchmarks MP MFLOPS Benchmarks
OpenMP and QPAR Benchmarks



General

These benchmarks execute the same code as the original, designed to exercise a single CPU, but implementing multithreading to use up to all available cores. Some employ a single method of manual procedures, where there might be more suitable options, with others using OpenMP and QPAR to automatically generate parallelism. In most cases, 32 bit and 64 bit compilations are provided for Windows and Linux.

This report includes detailed results for a quad core, eight thread 3.9 GHz Intel Core i7 CPU, and provides links to others covering various different CPUs.

Whetstone Benchmark - is mainly dependent on floating point speed but with some independently timed integer test functions. Each thread executes shared code using mainly L1 cache based independent variables, leading to performance being proportional to the number of cores, or higher with hyperthreading.

Assembly Code Arithmetic - These execute integer and SSE floating point add instructions via independent threads. On that i7, it demonstrates, via four cores, up to 61.5 GFLOPS (max spec 62.4) or 12.3 Integer MIPS per MHz.

BusSpeed MP Benchmark - provides read only access to data in caches and RAM. It is intended to demonstrate bus operation and speed where data is transferred in bursts and maximum data transfer speed. In the original Windows version, each thread read all the data, starting at the same point. This had to be modified for Linux, due to excessive impact of caching. Cache based tests demonstrate up to 62 GB/second per core, with RAM 16 GB/second, using 1 thread, and up to 40 GB/second via 8 threads, 78% of maximum specification.

RandMem MP Benchmark - The program uses the same code for serial and random access via a complex indexing structure and comprises Read and Read/Write tests, covering data from caches and RAM. This benchmark uses data from the same array for all threads, but starting at different points. Serial reading is slower than BusSpeed MP but with similar multithreading gains. Random reading speed can be reduced due to burst reading over buses from caches and particularly RAM, but benefits from multithreading. Read/Write tests produce the worst performance characteristics, where single thread operation can be faster than using multiple threads, particularly with random access.

MP MFLOPS Benchmark - The benchmark carries out calculations of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word, via caches and RAM. Each thread deals with separate segments of the data, via shared code, fully demonstrating multithreading speed gains. Performance is highly dependent on ability of a compiler available at production time, in this case being using old i387, SSE or AVX1 instructions, and whether full SIMD is implemented for the latter. For the i7 based PC, quad core single precision GFLOPS are shown to be up to 24 with i87 or SISD, 94 with SIMD and 177 with AVX1. The calculations make use of linked multiply and add instructions, with a maximum of 249.6 GFLOPS for AVX1.

OpenMP and QPAR MFLOPS Benchmarks - The benchmarks carry out the same calculations as MP MFLOPS Benchmark, essentially using the same code, without any OpenMP code requirements, but with critical loops preceded by a simple “go parallel” directive. QPAR is a Microsoft alternative to OpenMP. With the results shown, OpenMP maximum speeds of 24 GFLOPS are demonstrated via Windows and 91 GFLOPS using QPAR. Linux results show 23 GFLOPS for 32 bit compilations, with 50 GFLOPS at 64 bits, improving to 94 GFLOPS with an AVX compile option.



Go To Start


Whetstone MP Benchmark

The Whetstone programs, initially used in 1972, were the first general purpose benchmarks that set industry standards of computer system performance. Further details and performance of early systems can be found in Whetstone Benchmark History and Results.

The overall performance rating was later upgraded to Millions of Whetstone Instructions Per Second (MWIPS from KWIPS) and the speed of the eight different test functions provided, in terms of Millions of Operations Per Second (MOPS) or MFLOPS for floating point calculations. Three PC multithreading versions are available, with results for all being included in Whetstone Benchmark Detailed Later Results. All come in 32 bit and 64 bit versions. The benchmarks effectively run independent threads, possibly demonstrating the best multithreading performance. Full samples of logged performance details are provided below. They are all for the same 3.9 GHz Core i7 CPU and demonstrate variability produced by different compilations, with source code in newsource.zip. and further details in dualcore.htm. As indicated, these are dual core benchmarks. They use independent code and data.

Later Windows Versions - whets8thread32.exe and whets8Thread64.exe and source code can be found in quadcore.zip . Further details are included in quad core 8 thread.htm. In this case, code is shared between threads, but each has its own data. The benchmarks run 1, 2, 4, 6 and 8 threads. The results below are for the quad core i7 processor with hyperthreading, that provides significant additional performance gains.

Linux Versions - whetsMP32, whetsMP32DP, whetsMP64 and whetsMP64DP can be downloaded in linux_multithreading_apps.tar.gz , along with source code. Further details can be found in linux multithreading benchmarks.htm. This multithreading benchmark also has a run time parameter to specify the number of threads (up to 64) with a default identified as configured CPUs in gathered system information.

                       Windows 2 Threads - 64 Bit Version

 Whetstone Single Precision MP SSE Benchmark Fri Jul 30 15:51:13 2010

 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64

 MFLOPS    Vax  MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Gmean   MIPS            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS

   1829  32308   8142   2157   1967   1442    263    105   5331   2904  17427
  Thread 1              1078    983    765    132   52.1   2653   1450  16339
  Thread 2              1079    984    677    131   52.6   2678   1454   1088

 ############################################################################

                   Windows 1 to 8 Threads - 32 Bit Version

 Whetstone Single Precision 8 Thread Benchmark Mon May 12 10:17:45 2014

 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86

 MFLOPS    Vax  MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Gmean   MIPS            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS

   1055  15341   3672   1243   1059    893   82.7   50.4   3502   5566   1482
             Thread 1   1243   1059    893   82.7   50.4   3502   5566   1482

   2112  30669   7343   2486   2120   1788    165    101   7004  11121   2963
             Thread 1   1243   1060    894   82.7   50.4   3502   5566   1481
             Thread 2   1243   1060    893   82.6   50.4   3502   5555   1481

   4217  60844  14635   4970   4239   3560    330    201  13986  22017   5852
             Thread 1   1243   1060    891   82.6   50.3   3503   5566   1470
             Thread 2   1241   1059    888   82.2   50.2   3489   5564   1464
             Thread 3   1243   1060    890   82.6   50.2   3492   5449   1458
             Thread 4   1243   1060    891   82.5   50.2   3502   5439   1459

   6316  72487  20696   7434   6357   5333    459    288  19319  23188   6802
             Thread 1   1239   1060    888   76.5   47.4   3237   3412   1159
             Thread 2   1238   1058    888   77.1   47.4   3188   3894   1122
             Thread 3   1240   1060    888   77.7   47.9   3206   3234   1157
             Thread 4   1239   1059    890   75.4   48.6   3248   4508   1214
             Thread 5   1240   1060    891   76.4   48.8   3227   4486   1061
             Thread 6   1238   1060    888   76.4   48.0   3213   3654   1090

   8406  80481  26596   9893   8473   7085    590    375  24845  22260   7541
             Thread 1   1237   1059    886   73.8   46.9   3108   2782    943
             Thread 2   1236   1058    886   73.7   46.9   3099   2782    943
             Thread 3   1237   1059    883   73.7   46.9   3104   2782    942
             Thread 4   1238   1060    886   73.8   46.9   3106   2783    943
             Thread 5   1238   1060    885   73.8   46.9   3110   2783    943
             Thread 6   1237   1059    886   73.7   46.9   3103   2782    942
             Thread 7   1233   1059    886   73.7   46.9   3108   2782    942
             Thread 8   1236   1060    887   73.7   46.9   3108   2783    943


                            Linux Results Next
 
############################################################################ Linux Up To 64 Threads - 2 and 8 shown Multithreading Single Precision Whetstones 32-Bit Version 1.0 Using 2 threads - Sat Nov 8 14:49:17 2014 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS 1 3653 1330 1329 938 94 42 4600 5853 932 2 3660 1330 1329 938 95 42 4600 5850 936 Total 7312 2660 2658 1877 189 85 9200 11703 1868 MWIPS 7305 Based on time for last thread to finish Multithreading Single Precision Whetstones 32-Bit Version 1.0 Using 8 threads - Sat Nov 8 14:50:13 2014 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS 1 3084 1324 1323 938 72 39 3200 2931 591 2 3088 1324 1326 938 72 39 3337 2855 592 3 3072 1322 1322 928 72 39 3175 3043 591 4 3073 1310 1280 933 72 39 3233 2884 591 5 3076 1317 1302 929 72 39 3389 2966 591 6 3074 1302 1303 933 72 39 3099 2911 592 7 3076 1319 1302 933 72 39 3202 2913 590 8 3068 1294 1258 936 72 39 3261 2920 590 Total 24612 10513 10417 7468 577 312 25897 23424 4728 MWIPS 24463 Based on time for last thread to finish

Go To Start

CPU Speed Via Assembly Language Add Instructions

The benchmarks use an integer test and a floating point test. They are first executed separately, followed by together in two or more threads, with speeds measured in Integer MIPS or MFLOPS. Below are example log files for a quad core 8 thread Core i7 CPU. Results are available in PC CPUID 1994 to 2013, plus Measured Maximum Speeds Via Assembler Code.pdf. The separate tests indicate three integer MIPS per MHz and (nearly) expected maximum SSE floating point adds of four per clock cycle, also, significantly higher throughput via eight threads, compared to four.

First Windows Versions (obsolete) - cpuidmp.exe and cpuidMP64.exe are included in dualcore.zip with source code in newsource.zip. This covered 1, 2 and 4 threads.

Later Windows Versions - cpuid8thread32.exe and cpuid8Thread64.exe and source code can be found in quadcore.zip . Further details and results are included in quad core 8 thread.htm. Test functions measure performance using 1, 2, 4, 6 and 8 threads.

Linux Versions - cpumaxmp32 and cpumaxmp64 with source code in linux_multithreading_apps.tar.gz . The benchmarks can have an input parameter for 1, 2, 4, 8, 16, 32 or 64 threads (example command ./cpumaxmp32 Threads 8), default being identified count, such as 8 for a quad core CPU with hyperthreading. Further details and results can be found in linux multithreading benchmarks.htm. This variety has separate tests for integer and floating point calculations at the designated thread count.

         Windows 1 to 8 Threads - 32 Bit Version

   CPU ID MP 8 Thread Test 32 bit Version 1.0 Sat May 10 12:11:41 2014

  Speed adding to registers   Pass 1   Pass 2   Pass 3

  Separate Tests
  32 bit SSE   MFLOPS         15458    15461    15461
  32 bit Integer MIPS         12291    12291    12291

  Two Threads Equal Priority
  32 bit SSE   MFLOPS         15460    15460    15461
  32 bit Integer MIPS         12290    12292    12292

  Four Threads, First Normal Priority, Others Normal - 1
  32 bit SSE   MFLOPS         15425    15455    15457
  32 bit SSE   MFLOPS         15449    15455    15449
  32 bit Integer MIPS         12273    12190    12283
  32 bit Integer MIPS         11866    12194    12290

  Total  SSE   MFLOPS         30874    30910    30906
  Total  Integer MIPS         24139    24384    24573

  Eight Threads, All Normal Priority
  32 bit SSE   MFLOPS         13237     9434    11840
  32 bit SSE   MFLOPS         13747     9695    13896
  32 bit SSE   MFLOPS          8731    11788    11824
  32 bit SSE   MFLOPS          9154    15443    13920
  32 bit Integer MIPS          6171     7072     6624
  32 bit Integer MIPS          6353     6054     6802
  32 bit Integer MIPS          6743     6983     6604
  32 bit Integer MIPS          6809     6239     6833

  Total  SSE   MFLOPS         44869    46360    51480
  Total  Integer MIPS         26076    26348    26863

 ############################################################################

     Linux Multithreading Add Test 64 bit Version 1.0 Fri Oct 20 16:53:05 2017    
        
 Integer Additions 8 Threads                    SSE Floating Point Additions 8 Threads
        
 Thread   4 -    6350 64 bit Integer MIPS        Thread   3 -    7773 32 Bit SSE MFLOPS
 Thread   2 -    6196 64 bit Integer MIPS        Thread   8 -    7763 32 Bit SSE MFLOPS
 Thread   7 -    6181 64 bit Integer MIPS        Thread   5 -    7755 32 Bit SSE MFLOPS
 Thread   3 -    6169 64 bit Integer MIPS        Thread   2 -    7752 32 Bit SSE MFLOPS
 Thread   8 -    6145 64 bit Integer MIPS        Thread   4 -    7742 32 Bit SSE MFLOPS
 Thread   5 -    6077 64 bit Integer MIPS        Thread   7 -    7737 32 Bit SSE MFLOPS
 Thread   6 -    6047 64 bit Integer MIPS        Thread   1 -    7726 32 Bit SSE MFLOPS
 Thread   1 -    5990 64 bit Integer MIPS        Thread   6 -    7681 32 Bit SSE MFLOPS
 Total      -   49155 64 Bit Integer MIPS        Total      -   61929 32 Bit SSE MFLOPS
 Aggregate  -   47924 64 Bit Integer MIPS        Aggregate  -   61449 32 Bit SSE MFLOPS
 
                        Aggregate based on last to finish      

 Tot 4 Threads  33748                                           61549  

Go To Start


BusSpeed MP Benchmark

This version uses integer AND instructions to a single register, streaming data from caches or RAM. First test reads one word with a 32 word address increment for the next word. That is 128 bytes with 32 bit words and 256 bytes with 64 bit words. Address increment reduces for following tests to one word (ReadAll) - all via C. Last test reads all as 16 byte SSE2 data, using assembly code. The benchmark is intended to demonstrate bus operation and speed where data is transferred in bursts and maximum data transfer speed. On the latest systems, multiple programs or threads are clearly needed for maximum throughput.

First Windows Versions (obsolete) - busmp.exe, busMP64.exe and busMP64Int32.exe are included in dualcore.zip with source code in newsource.zip results in busspd2k results.htm

Later Windows Versions (1 to 8 threads) - bus8thread32.exe and bus8thread64.exe, also in quadcore.zip . with further details included in quad core 8 thread.htm. The following 64 bit example results include some for 32 bit tests, where SSE2 functions are from the same code, but 32 bit words are used for the integer tests, instead of 64 bits. For ReadAll cache based tests, CPU speed (MIPS) tends to be the same, with double data transfer speeds at 64 bits. Then, using RAM, bus and memory speeds become the limiting factors.

Linux Versions MPbusspeed32, MPbusspeed64, MPbusspeed32V2 and MPbusspeed64V2 - can be found in linux_multithreading_apps.tar.gz . They have the same run time format as the above Linux benchmarks for up to 64 threads. Further details can be found in linux multithreading benchmarks.htm. See Linux comments on the next page.

                    Windows 1 to 8 Threads - 64 Bit Version

     MP Bus Speed Test 64 bit Version 2.0 Sat May 10 11:57:03 2014
          
                     Part 1 - 1 Thread MBytes/Second                        32 bit

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2 ReadAll 128bSSE2

        6    31565    31291    31178    42042    42508    41978    61606   21375   61610
       24    31300    31285    31258    42203    42786    41751    62331   21157   62329
       96     5375     5559     5793    11083    20009    34332    40516   20363   40673
      384     5562     5658     5864    11338    19966    33244    39317   20679   38413
      768     5331     5391     5505    10966    19403    32805    37871   20680   38718
     1536     5364     5427     5508    10779    19355    33166    37951   20679   38331
    16380     1070     1356     1955     4248     8046    16688    16838   14103   16757
   131070     1034     1272     1866     4023     7724    16029    15980   13852   15963

                    Part 2 - 2 Thread MBytes/Second

        6    63147    62371    62552    83983    85074    83689   123233   42597  123206
       24    62579    62580    62188    84353    85351    83515   124250   42252  124188
       96    10779    10875    11473    21904    39332    67550    80624   40717   80597
      384    10088    11391    11560    22649    39705    67022    78033   41352   76206
      768    10574    10610    11042    21889    38669    65967    76066   41356   77275
     1536    10442    10637    10901    21597    38467    66046    75829   41353   76302
    16380     1798     2305     3397     7161    13913    28647    28743   25980   28471
   131070     1780     2310     3424     7193    13808    28589    28617   26066   28578

                    Part 3 - 4 Thread MBytes/Second

        6   116410   124710    92330   167023   148833   165596   245603   70155  238644
       24   124722   124658    96440   143956   153894   165793   248402   67455  225894
       96    21213    21636    20486    39631    73042   115995   159914   74935  123866
      384    21720    22354    22996    44788    79335   111720   155599   76795  128063
      768    18098    19577    21168    41296    71833   128568   126837   75598  138878
     1536    13887    19117    20564    37334    73001   126388   143677   74219  129958
    16380     2113     2780     4682     9428    18500    36759    37534   36126   37098
   131070     2109     2598     4681     8806    18112    37049    37477   36384   35472

                    Part 4 - 6 Thread MBytes/Second

        6   118438   106222   105201   161860   157529   178558   295443   88245  309920
       24    89228    71127    80985   110402   127049   167495   216617   87712  228035
       96    17634    19432    18990    38043    68990   111843   134485   83460  143356
      384    18645    18932    19929    42970    76220   123858   138682   83237  142146
      768    18072    17529    19655    40544    65312   124557   132566   79248  141036
     1536    14363    16097    18084    35815    59434   104533   128989   73640  123287
    16380     2043     2763     4568     9273    18501    36749    36798   36852   36663
   131070     2082     2689     4508     9093    18033    35246    36318   36347   36784

                    Part 5 - 8 Thread MBytes/Second

        6   124479   125263   124774   196833   206725   212245   392939  107411  402166
       24    53893    57161    59948    89256   129520   173683   263250  100645  259380
       96    21217    21589    22492    44013    84359   147831   165343   98906  164050
      384    21016    21622    22726    43780    80221   147095   165442   98539  161937
      768    19382    20258    21737    42635    80814   144343   159745   98558  160982
     1536     9986    10664    12858    24661    49622    83158    93985   60140   92112
    16380     2074     2748     4525     9123    18245    36548    36486   36504   36414
   131070     2072     2759     4525     9123    18216    36571    36445   36481   36443

                                   Linux Results Next 

Go To Start


Linux Results

Windows versions, and initial Linux programs, arranged for all threads to start by reading data from the beginning. This did not appear to raise any issues via Windows but it clearly did so using Linux. This became particularly noticeable on later CPUs, such as the Core i7 reported on here, with a 10 MB shared L3 cache. Maximum memory data transfer speed of this PC is 51.2 GB/second.

The first results below are for Version 1, single thread, 64 bit and 32 bit, with performance similar to the Windows versions, that is faster integer MB/second via caches at 64 bits. The other results are for 64 bit Version 2, where performance is quite similar to the Windows (Version 1) speeds. {Ignore 6 KB speeds - needs a longer test] The last two columns are for Linux Version 1 results, where RAM speeds are shown to be faster than the 51.2 GB/second specification, due to caching effects. In Version 2, each thread reads all the data but at staggered starting points and additional RAM is read, to prove the point. Now a maximum of 40.6 GB/second is shown, at 4 threads, 2.2 times faster than that with one thread.

  MP Bus Speeds 64 bit Version 1.0, 1 Threads, Sun Oct 22 14:03:08 2017      32 bit

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2 ReadAll 128bSSE2

        6    31168    31240    31189    42408    43371    43670    61517   20443   61544
       24    31267    31251    31217    42139    43348    43816    62254   20787   62259
       96    13627    14374    15240    24228    32977    40497    60299   20459   60286
      384     5556     5707     5797    11305    20134    34224    39990   20366   41534
      768     5348     5442     5555    10923    19356    33585    38201   20385   38255
     1536     5311     5421     5555    10924    19385    33698    38255   20421   38362
    16380     1240     1564     2130     4671     9149    18280    18843   16515   19109
   131070     1201     1469     2098     4573     8500    18137    18128   16472   17808
   393210     1155     1453     2098     4557     8112    18145    17813   15913   18024


  MP Bus Speeds 64 bit Version 2.0, 1 Threads, Sun Oct 22 14:06:58 2017     Version 1

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2 Read All 128bSSE2

        6    31594    31270    31258    41133    36625    41267    61563   43670   61517
       24    31283    31252    31211    42440    38461    42184    62258   43816   62254
       96    14896    15334    15560    24390    32204    39245    60395   40497   60299
      384     5703     5835     5988    11721    20542    34338    40726   34224   39990
      768     5389     5468     5563    10924    19334    33585    38159   33585   38201
     1536     5365     5453     5564    10925    19339    33598    38187   33698   38255
    16380     1285     1562     2179     4795     8882    18631    19165   18280   18843
   131070     1225     1453     2096     4460     8528    18195    18187   18137   18128
   393210     1225     1454     2077     4477     8703    18188    18059   18145   17813
   786420     1230     1450     2051     4576     8571    17598    18190
  1572840     1216     1486     2086     4583     8647    18109    18427

2 Threads
        6    29512    30019    59608    60056    69436    80271   102281   84136  123044
       24    59225    59487    58806    83693    75177    83373   124495   86728  121640
       96    20250    21156    21937    38565    59794    76975   120371   80333  121121
      384    10653    10963    11272    21556    38987    59334    80732   65431   82328
      768    10087    10384    10637    19731    36985    63797    75626   63587   76116
     1536    10103    10435    10729    20807    37071    63898    76338   63838   76340
    16380     2628     3222     4158     8358    15989    32486    33558   32248   33552
   131070     1968     2585     3803     8004    15471    31863    32579   32166   33354
   393210     1969     2594     3825     7570    15511    31911    32714   32125   33558
   786420     1966     2592     3722     7989    15429    32025    32676
  1572840     1970     2593     3839     8112    15467    32103    32767

4 Threads
        6    25920    29754    58965    64123    95935   147826   260224  167038  205273
       24   114028   118093   119688   117904   114405   163844   244665  173073  243044
       96    42412    42912    43013    75571   119669   154540   240629  160370  241160
      384    20903    21781    22653    42992    77420   128661   163280  127537  159648
      768    19201    19029    20653    39706    72719   117327   151481  125191  151515
     1536    18637    19725    20659    39744    73196   101971   151967  125482  151584
    16380     6026     6764     8179    14740    28176    54888    58802   57175   61785
   131070     2034     3088     5019    10004    19712    38982    40418   52960   61816
   393210     2033     3099     4303    10048    19856    39126    40572   57642   53405
   786420     2068     3092     5050    10077    19819    39096    40628
  1572840     2032     2858     4348     9412    19851    39157    39699

 8 Threads
        6    10245    11452    24238    46432    91436    85135   216659  151955  278208
       24    42877    46747    90912    92228   124711   142776   283743  150852  298146
       96    36838    44259    43458    80107   122566   136226   193969  138749  276197
      384    23488    22078    28973    53186    85603   138786   176651  122014  206507
      768    21820    25828    27393    38557    79105   149178   190188   95956  162380
     1536    20182    21804    25304    40594    72493   120503   155289  112027  177947
    16380     6786     7686     9822    19679    35524    59894    73745   64625   65317
   131070     3015     3832     4361     9619    19162    39564    38654   47164   46280
   393210     2390     3176     4901     9995    19884    39652    42583   50841   51818
   786420     2300     3045     4821    10165    19444    38217    38839
  1572840     2032     2992     4792     9680    19259    38238    38778  

Go To Start


RandMem MP Benchmark

The program uses the same code for serial and random access, via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from all caches and RAM. This benchmark uses data from the same array for all threads, but starting at different points. All indexing and arithmetic is carried out using 32 bit integers, leading to 64 bit and 32 bit compilations producing the same performance, subject to variations caused by short running times. Only 64 bit results are provided below.

First Windows Versions (obsolete) - RandMP32.exe and RandMP64.exe are also in dualcore.zip with source code in newsource.zip and further details in randmem results.htm.

Later Windows Versions (1 to 8 threads) - Rand8Thread32.exe and Rand8Thread64.exe are available in quadcore.zip . with further details included in quad core 8 thread.htm.

Linux Versions MPrandmem32 and MPrandmem64 can be found in linux_multithreading_apps.tar.gz . They have the same run time format as the above Linux benchmarks for up to 64 threads. Further details can be found in linux multithreading benchmarks.htm.

The Linux benchmark has additional Mutex tests that restrict updating access to one thread at a time. The effect appears to produce some faster speeds with cached data but slower from RAM. With the other procedures, multithreading performance gains and losses are different between the Windows and Linux compilations.

                   Windows 1 to 8 Threads - 64 Bit Version

 RandMP 8 Thread Write/Read Test 64 bit Ver. 2.0 Sat May 10 14:38:49 2014
 
               ------------------ MBytes Per Second At --------------------
               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB
 1 Thread
 Serial RD    30475   30602   18350   17605   17557   17595   12405   11243
 Serial RW    30195   30175   22013   17576   17469   17482   11531   10642
 Random RD    28916   29109   13124    8290    6319    5669    1308     655
 Random RW    30726   29935    9498    6161    5232    4813    1185     608
 2 Threads Total
 Serial RD    61153   60840   36843   35338   35231   35297   23339   21157
 Serial RW    21994   21510   20967   21670   33037   33428   23256   21508
 Random RD    57862   57902   26154   16611   12484   11248    2622    1302
 Random RW     3761    4658    5132    6599    6963    7114    2399    1282
 4 Threads Total
 Serial RD   116765  120919   73499   62205   70973   70902   45280   41568
 Serial RW    20776   31110   38023   42715   43836   65636   47000   42876
 Random RD   110503  115241   52011   32996   24884   22294    5247    2540
 Random RW     3324    6532    8197   11159   11724   12507    4747    2494
 8 Threads Total
 Serial RD   111370  114213   95358   92240   89120   87557   74104   63754
 Serial RW    28212   37141   54805   64501   56425   72723   70007   49286
 Random RD   108353  110797   59991   41932   32190   14669    4878    2897
 Random RW     5150    8024    9153   17569   15918   13841    4661    2528

                   Linux 1 to 8 Threads - 64 Bit Version

    RandMemMP Speeds 64 Bit Version 1, X Threads, Sun Oct 22 15:00:43 2017
 
               ------------------ MBytes Per Second At --------------------
               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB
 1 Thread
 Serial RD    27991   27801   20258   19249   19249   19294   12477   11683
 Serial RW    29969   30241   21896   17829   17494   17499   12085   11565
 Random RD    27484   27463   13589    8257    6220    5604    2471    1011
 Random RW    30364   30075    9168    6108    5177    4783    2804     982
 Mutex SRW    29982   30245   21897   17762   17433   17432   12130   11529
 Mutex RRW    30361   30071    9176    6108    5175    4782    2772     982
 2 Threads
 Serial RD    40622   55523   40299   38028   37866   37878   23094   22142
 Serial RW    14539   21855   20979   22448   31456   25642   24743   18109
 Random RD    40316   54307   26840   16365   12340   11092    4747    1913
 Random RW     3039    4599    5107    6570    6943    7115    4904    1773
 Mutex SRW    15294   29770   21777   17761   17385   17130   12099   11298
 Mutex RRW    22396   29829    9251    6098    5174    4779    2817     970
 4 Threads
 Serial RD    39300  106376   80250   75904   75310   75408   43206   37738
 Serial RW    15182   31547   35603   38859   45426   60180   48848   20287
 Random RD    72790  104282   52951   31312   12640   21975    6813    3317
 Random RW     2582    5910    8171   11159    9140   12510    9591    3261
 Mutex SRW    20566   29383   21517   18150   16703   16945   11798   11177
 Mutex RRW    22006   29629    8880    5881    5035    4666    2702     967
 8 Threads
 Serial RD    37987   76974   96575   94809   88112   88170   66556   60949
 Serial RW     9030   29524   52796   47811   52557   69516   68200   25318
 Random RD    37120   76419   65662   32215   24619   22463   13226    3346
 Random RW     2013    6036    9032   17133   16426   15039   11082    2829
 Mutex SRW     8207   17043   20147   17135   16675   16621   11714   10827
 Mutex RRW     9865   20828    8613    5574    4889    4567    2676     951

Go To Start


MP MFLOPS Benchmarks

The benchmarks carry out calculations of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word, using data sizes of 0.1, 1 .02 and 10.2 million words. Each thread deals with separate segments of the data, via shared code. Both 32 bit and 64 bit versions have been produced, with results in single precision MFLOPS.

Windows Versions - MPmflops32.exe using 32 bit instructions, MPmflops64.exe with SSE instructions, MPmflopsc2.exe a later 64 bit SSE compilation for full SIMD operation and MPmflopsAVX.exe a 64 bit compilation using /arch:AVX option. The benchmarks and source code are available in gigaflops-benchmarks.zip, with further details and results in GigaFLOPS Benchmarks.htm All were compiled from the same code to handle up to 64 threads (Command Format Example - MPmflopsc2 Threads 8).

Linux Versions - MPmflops32, MPmflops32SSE and MPmflops64, where benchmarks and source code are also in linux_multithreading_apps.tar.gz , again for up to 64 threads. Further details and results can be found in linux multithreading benchmarks.htm. Later MPmflops64AVX was produced and is in AVX_benchmarks.tar.gz, with details in Linux AVX_benchmarks.htm.

Results for runs on Windows and Linux are below. The first is from compilation for old i87 32 bit floating point. The second had a compiler directive to use SSE functions, but only achieved Single Instruction Single Data (SISD) operation, using one word out of the 4 word registers, and slightly faster during the early tests. The third results, with an AVX compiler directive, generated the appropriate vector instructions, but applied to SSE 128 bit registers, to produce the same performance as the SSE tests.

Maximum SSE MFLOPS per core are equal to CPU MHz x 4 (128 bit SSE register width) x 2 (linked multiply and add) or 31.2 GFLOPS for the Core i7 considered here, giving 124.8 GFLOPS for four cores. The 256 bit AVX registers double this score. Both Windows and Linux programs demonstrated respectable performance of more than 90 GFLOPS for SSE and the Linux Benchmark near 180 GFLOPS using AVX instructions.

                                Windows MFLOPS 1 to 16 Threads

 Operations Per Word     2      2      2       8      8      8      32     32     32
      Million Words   0.10   1.02  10.24    0.10   1.02  10.24    0.10   1.02  10.24
             Threads
 Core i7 4820K    1   3867   3853   3386    6085   6054   6017    5830   5824   5809
 256 KB x 4 L2    2   7737   7731   6618   12160  12165  11991   11653  11648  11650
 4 core 8 Thrd    4  15433  15459   9833   23487  24291  23886   22666  23175  23220
 3900 MHz i87     8  15359  15395   9846   23554  23708  23586   23418  23464  23416
 Windows i87     16  15145  15192  10023   23422  23536  22966   23241  23401  23282

 Core i7 4820K    1   5004   4960   4192    6188   6182   6135    5890   5890   5887
 256 KB x 4 L2    2   9996  10002   8049   12371  12354  12282   11770  11779  11744
 4 core 8 Thrd    4  19923  18532   9866   23946  24704  24347   23219  23531  23497
 3900 MHz         8  19602  19776   9820   24683  24648  24634   23521  23497  23506
 Windows SISD    16  18727  19077  10073   24316  24243  24442   23469  23393  23385

 Core i7 4820K    1  10116   9864   5852   24636  24436  19881   23353  23389  23243
 256 KB x 4 L2    2  26453  19851   9189   49181  49223  34969   46653  46759  46414
 4 core 8 Thrd    4  41845  26975  10063   85909  93852  40163   89202  90572  87329
 3900 MHz         8  58734  43723   9980   97139  98446  40062   91320  93885  93125
 Windows SIMD    16  57731  42194  10178   94166  93338  40074   90162  92102  93496

 Core i7 4820K    1  10046   9901   5906   24629  24382  19832   23411  23361  23246
 256 KB x 4 L2    2  26634  19679   9250   49194  49267  35183   46788  46788  46382
 4 core 8 Thrd    4  52424  39057  10092   60266  98220  39744   90948  90611  92515
 3900 MHz         8  58601  43529  10032   85198  98220  40162   93810  93866  93745
 Windows AVX 1   16  57098  42920  10319   86267  95243  40427   92929  92995  92356

                               Linux MFLOPS 1 to 8 Threads

 Operations Per Word     2      2      2       8      8      8      32     32     32
      Million Words   0.10   1.02  10.24    0.10   1.02  10.24    0.10   1.02  10.24
             Threads
 Core i7 64 bit   1   9681   9759   5990   24533  24570  19975   23269  23307  23052
 4820K            4  45340  21688   9237   49320  49918  36638   46942  89676  91029
 Linux SIMD SSE   8  54621  41832  10026   92086  92352  39982   92408  93282  92050

 Core i7 64 bit   1  12542  11404   5991   35982  36180  23299   46400  46572  44729
 4820K            4  62273  23031   8970  159040  80096  40124   90572  91058  88877
 Linux SIMD AVX   8  60258  44329   9977  173224 151909  40153  173372 177831 158594
  

Go To Start


OpenMP and QPAR Benchmarks

The benchmarks carry out the same calculations as MP MFLOPS Benchmarks above, but without multithreading code, where main loops are preceded with a "#pragma omp parallel for" directive and, in some cases, a compile parameter.

Windows OpenMP Benchmarks - OpenMP32MFLOPS.exe, SSE32MFLOPS.exe (same code no OpenMP directives) and OpenMP64MFLOPS.exe are included in openmpmflops.zip. Further details and results are included in openmp mflops.htm. Different OpenMP benchmarks are covered in openmp speeds.htm.

With Visual Studio 2012, Microsoft added QPAR, Auto-Parallelizer, to the compiler, that can automatically generate multiple threads in the same way as OpenMP. The benchmark QparMP64MFLOPS.exe was produced, with execution and source files included in gigaflops-benchmarks.zip, with details and results in GigaFLOPS Benchmarks.htm and quad core 8 thread.htm.

Linux Original Versions - openMPmflops32, openMPmflops64, notOMPmflops32 and notOMPmflops64, from linux openmp.tar.gz with details in linux openmp benchmarks.htm. Then there are Later Versions - openMPmflops64, notOMPmflops64 and openMPmflops64AVX in AVX_benchmarks.tar.gz, with details in Linux AVX_benchmarks.htm.

Results below are again from benchmarking the 3.9 GHz Core i7.

Windows OpenMP64MFLOPS.exe provides similar speeds to 64 bit MP-MFLOPS SISD at 32 operations per word, otherwise it is slower.

QparMP64MFLOPS.exe obtains similar 4 thread performance as MP-MFLOPS SIMD. QPAR appears to provide a better alternative than OpenMP but, overall, hand coded multithreading seems to be the best option.

Linux notOMPmflops64 V1 and V2 achieve similar speeds as the single thread MP-MFLOPS benchmark, but not so, compared to the 4 thread test, and particularly the one using 8 threads.

openMPmflops64AVX performance is generally inferior to that from Linux MPmflops64AVX.


 Operations Per Word     2      2      2       8      8      8      32     32     32
      Million Words   0.10   1.02  10.24    0.10   1.02  10.24    0.10   1.02  10.24

 Windows

 SSE32MFLOPS.exe      4898   4845   4171    5824   5994   6094    5796   5829   5795
 OpenMP32MFLOPS.exe   6511   9290   9119   14351  17324  17592   21454  22884  22850

 OpenMP64MFLOPS.exe   8420  12440   9483   18477  23210  23737   22134  18281  19690

 QparMP64MFLOPS.exe 
 1 Thread             9691   9454   5743   23214  23126  19033   22700  23541  23405
 2 Threads           23972  18673   9177   44855  44919  33868   44070  45733  46419
 4 Threads           43356  36007  10084   76380  91259  40349   85300  81803  69212
 8 Threads           44741  33966   9732   81506  73857  36635   87736  91170  87086

 Linux

 notOMPmflops64 V1   10093   9803   5919   24634  24651  20097   23519  23520  23339
 openMPmflops64 V1    9084  12363   8089   22273  23039  22432   22683  23195  23096

 notOMPmflops32 V2    3884   3886   3612    6145   6151   6067    5837   5835   5830
 openMPmflops32 v2    9483  12481   8628   22347  23032  22742   22691  23247  23126

 notOMPmflops64 V2    9879   9772   5934   24500  24529  20039   23285  23290  23090
 openMPmflops64 V2   11163  20322   9180   45392  49695  33927   21534  22477  22476

 openMPmflops64AVX   19713  37822   9219   94036  68725  36923   22761  23133  23019
  

Go To Start